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Summary 


Given  two  time  series  X  and  Y,  their  mutual  information,  I(X,Y)=I(Y,X),  is  the  average  number 
of  bits  of  X  that  can  be  predicted  by  measuring  Y  and  vice  versa.  In  the  analysis  of  observational  data, 
calculation  of  mutual  information  occurs  in  three  contexts,  identification  of  nonlinear  correlation, 
determination  of  an  optimal  sampling  interval  particularly  when  embedding  data,  and  in  the  investigation 
of  causal  relationships  with  directed  mutual  information. 

In  this  report  a  minimum  description  length  argument  is  used  to  determine  the  optimal  number  of 
elements  to  use  when  characterizing  the  distributions  of  X  and  Y.  However,  even  when  using  partitions  of 
the  X  and  Y  axis  indicated  by  minimum  description  length,  mutual  information  calculations  performed 
with  a  uniform  partition  of  the  XY  plane  can  give  misleading  results.  This  motivated  the  construction  of 
an  algorithm  for  calculating  mutual  information  that  uses  an  adaptive  partition.  This  algorithm  also 
incorporates  an  explicit  test  of  the  statistical  independence  of  X  and  Y  in  a  calculation  that  returns  an 
assessment  of  the  corresponding  null  h3pothesis. 

The  previously  published  Fraser-Swinney  algorithm  for  calculating  mutual  information  is 
described.  This  algorithm  includes  a  sophisticated  procedure  for  local  adaptive  control  of  the  partitioning 
process.  When  the  Fraser  and  Swinney  algorithm  and  the  algorithm  constructed  here  are  compared,  they 
give  very  similar  numerical  results.  Detailed  comparisons  are  possible  when  X  and  Y  are  correlated 
jointly  Gaussian  distributed  because  an  analytic  expression  for  I(X,Y)  can  be  derived  for  that  case.  Based 
on  these  tests,  three  conclusions  can  be  drawn.  First,  the  algorithm  constructed  here  has  an  advantage  over 
the  Fraser-Swinney  algorithm  in  providing  an  explicit  calculation  of  the  probability  of  the  null  hypothesis 
that  X  and  Y  are  independent.  Second,  the  Fraser-Swinney  algorithm  is  the  more  accurate  of  the  two 
algorithms  when  large  data  sets  are  used.  With  smaller  data  sets,  the  Fraser-Swinney  algorithm  reports 
structures  that  disappear  when  more  data  are  available.  Third,  the  algorithm  constructed  here  requires 
about  0.5%  of  the  computation  time  required  by  the  Fraser-Swinney  algorithm. 
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I.  Introduction 


Given  two  time  series  {X}  =  {xi,x2, . {Y}  =  {yi,y2> . yNo)*  their  mutual 

information,  I(X,Y),  is  the  average  number  of  bits  of  {X}  that  can  be  predicted  by  measuring  {Y}.  It  can 
be  shown  that  this  relationship  is  symmetrical,  I(X,Y)=I(Y,X).  A  mathematical  definition  of  mutual 
information  and  a  demonstration  of  this  property  is  given  in  the  first  appendix.  This  appendix  includes  a 
summary  of  the  principal  mathematical  properties  of  I(X,Y).  A  more  systematic  presentation  is  given  in 
Cover  and  Thomas  (1991).  In  the  analysis  of  observational  data,  calculation  of  mutual  information  occurs 
in  three  contexts,  identification  of  nonlinear  correlation,  determination  of  an  optimal  sampling  interval 
particularly  when  embedding  time  series  data,  and  in  the  investigation  of  causal  relationships  with 
directed  mutual  information. 

Mutual  information  can  be  used  to  identify  and  quantitatively  characterize  relationships  between 
data  sets  that  are  not  detected  by  commonly  used  linear  measures  of  correlation.  Figure  1  recapitulates  an 
example  shown  in  Mars  and  Lopes  da  Silva  (1987)  and  displays  three  data  set  pairs.  The  first  shows  Xj 
when  Xj  =-3  to  +3  in  steps  of  0.0006  plotted  against  8; ,  a  random  normally  distributed  variable  with  zero 
mean  and  unit  variance.  The  second  element  of  Figure  1  shows  X;  versus  Xj  +  .2ej  where  Sj  is  the 

previously  used  random  variable.  In  the  third  example  of  Figure  1,  y;  =  xf  +  .26; .  Four  measures  were 
calculated  with  ten  thousand  element  data  sets.  The  first  was  the  linear  correlation  coefficient  r  (Press,  et 
al.,  1992).  The  probability  of  the  null  hypothesis  of  zero  linear  correlation  also  was  calculated.  A  small 
value  of  P^uii  indicates  a  high  degree  of  linear  correlation.  The  Spearman  rank  order  correlation  r3  and 
the  probability  of  the  corresponding  null  hypothesis  of  non-correlation  was  calculated.  If  Pj,^ui,  is  small 
and  rg  is  positive,  a  positive  correlation  has  been  detected.  If  Pj,(yj|  is  small  and  rg  is  negative,  anti¬ 
correlation  has  been  detected.  Kendall’s  tau,  a  nonparametric  measure  of  correlation,  and  its  associated 
^Nuii  calculated.  This  set  of  calculations  also  incorporated  estimation  of  mutual  information 

between  {X}  and  (Y}  using  an  algorithm  that  will  be  described  in  a  subsequent  section.  That  section  also 
includes  a  description  of  the  procedure  used  to  calculate  the  probability  of  the  null  hypothesis  of  statistical 
independence. 
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X 

Figure  1.  Data  sets  used  in  the  correlation  study  of  Table  1.  In  each  case,  x  varies  from  -3 

to  +3  in  steps  of  .0006.  A.  y;  =  6; ,  a  normally  distributed  random  variable  with  zero 

mean  and  unit  variance.  B.  yj  =Xj  +  .28;  C.  y;  =Xi  +.28, 

The  results  are  shown  in  Table  1 .  In  the  case  of  normally  distributed  random  numbers,  all  four 
measures  behave  in  a  manner  that  is  consistent  with  our  qualitative  understanding  of  the  word  correlation. 
Measures  r,  rg ,  x ,  and  I(X,Y)  are  small  and  the  probability  of  the  null  hypothesis  of  zero  correlation  or, 
in  the  case  of  mutual  information,  statistical  independence  is  high.  Similarly,  in  the  case  of  calculations 
with  linearly  correlated  noise  the  results  are  consistent  with  expectations.  The  correlation  measures  are 
very  nearly  equal  to  one  and  the  value  of  mutual  information  is  high.  The  corresponding  probability  of  the 
null  hypothesis  is  numerically  indistinguishable  from  zero  in  each  case. 

The  results  obtained  in  the  case  of  parabolic  correlation  merit  closer  inspection.  The  first  three 
measures  r,  rg ,  and  x  are  small  and  the  corresponding  values  are  high,  which  indicates  that  no 
correlation  was  detected.  In  contrast,  the  value  of  mutual  information  is  high,  essentially  equal  to  that 
obtained  using  linearly  correlated  data,  and  the  probability  of  the  null  h)q)othesis  of  statistical 
independence  is  zero.  Upon  reflection  it  is  seen  that  this  is  as  it  should  be.  Mutual  information,  I(X,Y),  is 
the  average  number  of  bits  of  y  that  can  be  predicted  by  measuring  x.  Though  the  relationship  between 
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{X}  and  {Y}  in  the  third  example  is  not  linear,  the  relationship  does  confer  a  significant  predictive 
capacity.  This  is  reflected  in  the  high  value  of  I(X,Y)  and  in  the  low  value  of  . 


Table  1 .  Correlation  Analysis 


Pearson 

r 

Pearson 

PnuII 

Spearman 

rs 

Spearman 

PnuII 

Kendall’s 

Tau 

Kendall’s 

PnuII 

I(X,Y) 

I(X,Y) 

PnuH 

Normally 

Distributed 

Random 

-.0037 

.7112 

-.0040 

.6854 

.0027 

.6845 

.1356 

.7851 

Linearly 

Correlated 

.9934 

0. 

.9936 

0. 

.9270 

0. 

2.9186 

0. 

Parabolically 

Correlated 

.0001 

.9912 

<10-4 

.9928 

<10“^ 

.9989 

3.0304 

0. 

Mutual  information  is  thus  seen  to  be  a  nonlinear  generalization  of  the  concept  of  linear 
correlation.  Beginning  with  the  pioneering  work  of  Callaway  (Callaway  and  Harris,  1974)  and  Mars  and 
Lopes  da  Silva  (Mars,  et  al.,  1985,  1987),  mutual  information  has  been  used  in  studies  of  nonlinear 
correlations  in  multichannel  EEGs.  The  investigations  indicate  that  mutual  information  estimates  can  be 
used  to  discriminate  between  focal  and  generalized  seizures.  Additionally,  in  the  case  of  focal  seizures, 
the  method  can  be  used  to  identify  the  location  of  the  epileptogenic  focus  (Mars,  et  al.,  1985). 
Generalizations  of  the  procedure  in  the  form  of  directed  mutual  information  will  be  considered  presently. 

Mutual  information  estimates  also  can  be  used  to  determine  an  appropriate  sampling  interval,  Tg , 
which  is  the  time  between  consecutive  measurements  of  a  time  series.  Many  of  the  calculations  presented 
here  will  be  calculations  directed  to  this  question.  The  selection  of  an  appropriate  sampling  interval  is  an 
important  consideration  when  the  quantitative  methods  of  dynamical  analysis  are  applied  to  time  series 
data.  On  first  consideration,  one  might  suppose  that  the  smallest  possible  Tg  would  be  the  best  option. 
While  this  may  be  a  reasonable  approach  during  data  acquisition,  this  strategy  can  fail  during  analysis 
because  calculations  with  over-sampled  data  can  produce  misleading  results  (Rapp,  et  al.,  1993). 
Historically,  calculation  of  the  autocorrelation  time,  the  time  required  for  the  autocorrelation  function  to 
drop  to  1/e  of  its  initial  value,  has  been  used  to  establish  an  approximate  sense  of  the  time  scale 
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corresponding  to  significant  changes  in  a  time  series’  behavior.  However,  as  we  have  seen  in  the 
preceding  calculations,  linear  measures  can  give  an  incomplete  characterization  of  behavior.  This 
recognition  has  motivated  the  calculation  of  lagged  mutual  information. 

Let  {X}  be  the  original  time  series,  and  let  time  series  {Y}  be  the  same  time  series  shifted  by 
time  a  lag,  that  is,  yj  =  Xj+L^g  .  The  mutual  information  I(Xi,Xi+Lag)  is  then  calculated.  This  process  is 

repeated,  and  I(Xj,Xj^.Lag)  is  determined  as  a  function  of  Lag.  In  order  to  get  the  most  new  information 

from  a  measurement,  we  want  to  take  the  next  measurement  when  there  is  maximum  uncertainty  in  the 
relationship  between  {X}  and  {Y}.  Recall  that  I(X,Y),  which  is  symmetrical  I(X,Y)=I(Y,X),  is  the 
average  number  of  bits  of  Y  that  can  be  predicted  by  measuring  X  and  vice  versa.  Therefore,  the 
maximum  uncertainty  in  the  relationship  between  {X}  and  {Y}  will  occur  at  a  minimum  of 
I(Xj,Xj+Lag)  •  This  indicates  that  Tg  should  be  set  equal  to  a  value  of  Lag  that  gives  a  minimum  of 

I(Xj,Xj+Lag)  ■  It  can  be  further  argued,  for  example  see  Fraser  and  Swinney  (1986),  that  among  the  many 
different  minima  of  I(Xj,Xi+Lag) >  the  sampling  interval  should  correspond  to  the  first  minimum  of 
I(Xj,Xj+Lag)  •  This  is  particularly  true  when,  as  is  often  the  case,  chaotic  systems  are  being  investigated 
since  the  turbulent  mixing  of  a  chaotic  system  will  cause  an  unacceptable  loss  of  structure  if  Tg  is  too 
large. 


Estimation  of  mutual  information  is  often  required  when  embedding  dynamical  data.  In  the 
simplest  case,  an  analysis  based  on  embedded  data  begins  with  a  scalar  time  series  {X} .  The  elements  of 

{X}  are  then  used  to  form  an  m-dimensional  set  {Z}  e  fR"’  with  the  construction 

~  j  >  ^  j+Lag  >  ^  j+2Lag’ . ^  j+(m-l)Lag) 

The  analysis  then  continues  with  the  investigation  of  the  geometrical  properties  of  {Z} .  The  motivation 
for  this  approach  follows  from  the  Takens-Mane  embedding  theorem  (Takens,  1980;  Mane,  1980),  which 
shows  that  if  the  conditions  of  the  theorem  are  met,  then  an  intimate  relationship  exists  between  {Z}  and 
the  dynamical  system  which  generated  the  observed  times  series  {X}.  This  theorem  requires  the 
assumption  that  set  (Z)  is  dense.  This  can  never  be  satisfied  with  finite  data  sets.  However,  while 
recognizing  the  approximate  nature  of  the  analysis,  an  investigation  of  (Z)  can  in  some  instances  provide 
significant  information  about  the  underlying  generator.  A  crucial  operation  difficulty  is  encountered  when 
embedding  finite  observational  data  sets.  Embedding  parameters  m  and  Lag  must  be  chosen.  This  choice 
is  crucial  to  the  success  of  the  subsequent  analysis.  Inappropriate  choices  of  m  and  Lag  can  result  in  the 
spurious  indication  of  structure  in  random  data  (Rapp,  et  al.,  1993).  Conversely  an  inappropriate 
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specification  can  result  in  the  unnecessary  failure  to  identify  structures  that  are  indeed  present  in  the  time 
series.  Several  candidate  criteria  for  selecting  m  and  Lag  have  been  proposed.  An  incomplete  review  of 
the  very  large  embedding  criterion  literature  is  given  in  Cellucci,  et  al  (2003).  Fraser  and  Swinney  (1986) 
proposed  that  the  best  value  of  Lag  to  use  in  an  embedding  is  given  by  the  first  minimum  of  the 
I(Xj,Xj^Lag)  versus  Lag  function.  This  proposal  is  supported  by  Abarbanel  (1995).  To  a  limited  degree 

the  Fraser-Swinney  proposal  was  confirmed  in  a  recent  comparative  study  of  embedding  criteria 
(Cellucci,  et  al.,  2003).  It  should  be  noted,  however,  that  this  comparison  was  limited  to  four  criteria  and 
is  therefore  not  definitive. 

Mutual  information  calculations  are  also  important  in  the  characterization  of  causal  relationships 
between  two  time  series.  Correlation  measures,  both  linear  and  nonlinear,  quantify  the  degree  of 
correlation  between  {X}  and  {Y}  under  their  respective  definitions,  but  they  do  not  identify  causal 
relationships  in  the  sense  of  identifying  which  variable  drives  the  other,  if  indeed  such  a  relationship 
exists.  The  quantification  of  causal  relationships  is  a  problem  that  is  frequently  encountered  in  the 
investigation  of  econometric  data.  Historically  the  most  commonly  employed  measure  of  causality  in 
economics  research  is  Granger  Causality  (Granger,  1969;  Kaminski,  et  al.,  2001)  which  is  based  on  the 
construction  of  bivariate  autoregressive  processes.  A  complementary  procedure  for  the  investigation  of 
causal  relationships  can  be  constructed  by  examining  delayed  mutual  information  functions.  Stated 
informally,  if  a  measurement  of  variable  x  can  predict  the  future  of  y  more  effectively  than  measurement 
of  y  can  predict  x,  then,  in  that  limited  sense,  in  an  isolated  system  variable  x  can  be  said  to  drive  variable 
y.  Depending  on  the  complexity  of  the  interacting  variables  being  investigated,  causal  relationships  can  be 
complex  functions  of  time.  I(Xi,Yj+^)  is  the  average  number  of  bits  of  y  at  time  t  +  x  that  can  be 
predicted  by  measuring  x  at  time  t.  Conversely,  I(Yi,Xi^^)  is  the  average  number  of  bits  of  x  at  time 
t  +  x  that  can  be  predicted  by  measuring  y  at  time  t.  Xu,  et  al  (1997)  describe  I(Xi ,  Yj^.^  )  as  the  rate  of 
information  transmission  from  variable  x  to  variable  y  at  a  delay  of  x  .  Several  investigators  have  used 
this  technique  to  assess  the  time  dependence  of  between  channel  information  transfer  in  multichannel 
EEGs  (Inouye,  et  al,  1983, 1993;  Lopes  da  Silva,  et  al.,  1989;  Xu,  et  al.,  1997;  Chen,  et  al.,  2000). 

II.  Calculating  I(X,Y)  with  a  uniform  partition  of  the  XY  plane 

Let  {X}  =  {x,,X2,X3 . and  {Y}  =  {yi,y2,y3 . yNo^  series  of  equal  length. 

Suppose  that  the  distributions  of  X  and  Y,  Px(i)  and  PyO)  are  approximated  by  histograms  of  Nx  and 


8 


Ny  elements  that  uniformly  divide  the  range  to  Xjj^^x  and  to  y^ax  •  Though  it  is  commonly 
the  case,  it  is  not  necessary  for  Nx  to  be  equal  to  Ny .  Indeed,  in  some  instances  Nx  =Ny  is 
inappropriate.  Suppose  that  signal  X  was  digitized  with  8  bits  and  that  signal  Y  was  digitized  with  16  bits. 
Should  this  be  the  case,  a  Y  histogram  with  a  greater  resolution  is  justified.  A  uniform  partition  from 
Xjnin  to  Xj„ax  ymin  to  ymax  makes  the  calculation  sensitive  to  outliers.  This  potentially  serious 
deficiency  and  the  choice  of  Nx  and  Ny  will  be  addressed  presently.  Let  OxyO,])  denote  the 
occupancy  of  the  (ij)-th  element  of  the  partition  of  the  XY  plane  that  extends  from  Xj^j^  to  Xj^^^  on  the 
X  axis  (Nx  equal  elements)  and  from  y^^jn  to  y^ax  o^i  the  Y  axis  (Ny  equal  elements).  Consider  the 

observed  pair  (xij,y]^),  k^^l, . N^.  OxY(i,j)  is  incremented  by  one  if  Xi^  is  in  the  i-th  partition 

element  of  the  X  axis  and  y,^  is  in  the  j-th  element  of  the  Y  axis  partition.  This  process  continues  for  all 
(Xk.Yk)  pairs.  PxyCiJ)  is  determined  by  normalizing  the  occupancy  against  the  number  of  paired 
observations;  PxY(i,j)  =  OxY(i,j)/N0  .  The  joint  probability  distribution,  PxY(i)j),has  NxNy  values, 
many  of  which  may  be  zero.  A  discrete  approximation  of  I(X,Y)  is  computed  using  the  relation  derived  in 
Appendix  1 . 

where  there  is  no  contribution  to  the  sum  if  PxyOj  j)  is  equal  to  zero. 

While  easy  to  implement,  this  procedure  for  estimating  mutual  information  contains  a  serious 
deficiency.  The  calculation  will  be  sensitive  to  the  choice  of  Nx  and  Ny  .  An  example  is  shown  in  the 
next  diagram.  I(Xj  ,Xj^.Lag)  is  plotted  as  a  function  of  Lag,  for  data  generated  by  the  Lorenz  system, 
dx  /  dt  =  ct(x  -  y) 
dy/dt  =  -xz  +  rx  -y 
dz/dt  =  xy  -bz 

where  a  =  10 ,  b=8/3  and  r=28.  Ten  thousand  values  of  the  x  variable  of  the  Lorenz  system  were  used  in 
calculations  where  Nx  =  Ny  =  N^ign^ents  equally  sized  elements  partition  each  axis.  In  these 
calculations,  a  well  characterized  minimum  of  I(Xj,Xj+Lag)  appears  at  Lag=18  when  NEiements  =50. 
However,  as  the  diagram  indicates,  this  minimum  is  lost  if  other  values  of  N Elements  used.  Since  the 

location  of  the  first  minimum  of  the  I(Xj,Xj+Lag)  versus  Lag  is  frequently  the  object  of  a  mutual 
information  calculation,  this  result  argues  against  the  common  practice  of  selecting  Nx  and  Ny 
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arbitrarily.  A  ratioiial  basis  for  selecting  can  be  constructed  using  a  minimum  description  length 

argument. 


Lag 


Figure  2.  I(Xj,Xi+Lag)  as  a  function  of  lag.  Ten  thousand  consecutive  values  of  the 
Lorenz  x  variable  were  used.  In  the  case  of  the  top  curve,  Npipm^nts  =50.  The  value  of 
^Elements  decrease  in  steps  of  ten  to  the  lower  curve  where  NEjgjjjents  ~  . 
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The  preceding  example  indicates  that  the  value  of  mutual  information  can  be  sensitive  to  the 
number  of  elements  used  when  a  uniform  partition  of  the  XY  plane  is  implemented.  We  must  therefore 
address  the  question  what  is  the  optimal  number  of  elements?  This  is  a  restatement  of  the  histogram 
problem  in  the  specific  context  of  mutual  information  calculations.  The  histogram  problem  is:  given  a 

scalar  data  set  X  =  {xj,X2, . x^} ,  how  many  elements  should  be  used  to  construct  a  histogram  of  X? 

If  there  are  too  many  elements,  each  element  has  an  occupancy  of  0  or  1  and  fails  to  identify  the 
distribution  of  X  in  a  meaningful  way.  Similarly,  if  there  are  only  a  small  number  of  elements  (consider 
the  limiting  case  of  a  single  element),  the  structure  of  the  distribution  cannot  be  discerned.  A  successful 
answer  therefore  lies  at  an  intermediate  value.  The  histogram  problem  has  a  long  history  and  has  been 
examined  by  several  investigators  (Bendat  and  Piersol,  1966  page  284;  Cocatre-Zilgien  and  Delcomyn, 
1992;  Mosteller  and  Tukey,  1977  page  49). 

1/9 

Tukey  suggest  that  n  ,  where  n  is  the  number  of  observations,  is  the  best  choice.  Bendat  and 
Piersol  (1966)  recommended  1.87(n  -  I)®  "* .  A  systematic  theoretical  development  of  the  question  is  given 
by  Rissanen  (1989,  page  76).  Rissanen  uses  a  minimum  description  length  argument  to  conclude  that  the 
optimal  value  of  the  number  of  elements  to  use  in  a  histogram  is  the  value  of  m,  m^pj ,  that  gives  a 

minimum  value  of  the  stochastic  complexity,  F(m). 

TR^  f  n  ^  fn  +  m-A 
F(m)=nlog2  —  +log2  +log2 

\mAJ  V  ^  J 

n  is  the  number  of  data  points  in  set  X.  R  is  the  range  of  X,  R  =  x^^^  “^min  •  ™  Ae  number  of 
elements  in  a  uniform  partition,  A  is  the  resolution  of  the  measurement  of  x,  and  ni,n2,---n^  are  the 
occupancies  of  each  element  in  the  partition.  The  multinomial  coefficient  is 

n  _  n! 

^ni,---,n„J  ni!n2!---n„! 

and  the  binomial  coefficient  is 

^n  +  m  -  A  _  (n  +  m-1)! 

^  n  j  n!(m-l)! 

The  value  of  A  only  shifts  the  function  by  an  additive  constant.  It  will  not  affect  the  value  of  m^pt .  If  the 
only  object  of  the  calculation  is  to  determine  m^pj ,  A  can  be  set  equal  to  1.  Base  two  logarithms  are  used 
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throughout  the  development  in  Rissanen,  but  again  if  the  sole  object  is  a  determination  of  m^pj ,  the 
choice  of  base  is  immaterial. 

F(M)  was  calculated  using  the  Lorenz  data  used  to  construct  Figure  2.  A  minimum  was  obtained 
at  m^pt  =  32  .  Using  this  value  for  the  number  of  elements  in  the  uniform  partition  of  the  X  and  Y  axes  in 

a  calculation  of  I(Xj,Xi^Lag)  gives  a  mutual  information  versus  lag  function  with  a  well  characterized 

first  minimum  at  Lag=21.  This  analysis  therefore  would  seem  to  provide  a  rational  procedure  for 
calculating  I(X,Y).  Application  to  the  Rossler  equations,  however,  shows  the  limitations  of  this  approach. 
The  Rossler  equations  used  in  the  next  calculations  were 
dx/dt=-y-z 
dy/dt=x+.2y 
dz/dt=.4+xz-5.7z 

Using  x-axis  data  generated  by  this  system,  a  calculation  of  the  Rissanen  F(M)  gives  a  minimum  at 
M=40.  A  forty-element  partition  of  each  axis  was  used  in  the  subsequent  calculations  of  mutual 
information  as  a  function  of  Lag  for  x,  y  and  z- variable  data.  The  resulting  mutual  information  versus  Lag 
functions  are  shown  in  Figure  3.  It  is  seen  that  while  x-axis  and  y-axis  data  give  functions  with  first 
minima  that  are  roughly  coincident,  the  function  obtained  with  z-axis  data  is  very  different. 
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Figure  3.  Mutual  information  I(Xj,X,+Lag)  ^  function  of  Lag  for  Rossler  data.  A 

uniform  partition  of  the  XY  plane  was  constructed  using  forty  elements  on  each  axis.  One 
hundred  thousand  data  points  were  used.  The  top  curve  was  obtained  with  variable  x.  The 
curve  immediately  below  it  was  constructed  with  variable  y  data.  The  lower  curve  was 
calculated  with  variable  z  data. 

The  cause  of  the  differences  in  the  z-variable  mutual  information  function  in  Figure  3  can  be 
identified  by  examining  a  three  dimensional  construction  of  the  trajectory  using  all  three  variables  (Figure 
4).  The  activity  of  the  Rossler  system  is  confined  predominantly  to  the  z  «  0  plane.  At  irregular,  chaotic 
intervals  there  is  an  abrupt  excursion  into  the  z>0  domain. 
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16 


Figure  4.  Three-dimensional  construction  of  the  Rossler  attractor  using  ten  thousand 
point  X,  y  and  z  vectors  generated  using  the  differential  equation  and  parameter  values 
specified  in  the  text. 


An  examination  of  the  histograms  formed  with  x,  y  and  z  data  (Figure  5)  shows  that  while  the  x  and  y 
values  are  approximately  uniformly  distributed,  most  of  the  activity  of  the  z  variable  is  confined  to 
[0,.375]  even  though  the  maximum  value  of  z  is  approximately  15. 
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Figure  5.  Histograms  constructed  with  Rossler  data.  The  histograms  were  formed  with 

the  ten  thousand  point  vectors  used  to  construct  the  three  dimensional  attractor  of  Figure 

4.  Note  that  the  ranges  of  the  vertical  axes  are  different. 

III.  Statistical  assessment  of  I(X,Y)  calculations 

The  results  with  Rossler  data  suggest  that  the  calculation  of  mutual  information  using  a  uniform 
partition  can  produce  misleading  conclusions.  An  alternative  to  uniform  partitioning  should,  therefore,  be 
sought.  An  additional  and  arguably  more  important  issue  should  also  be  addressed.  The  calculations  of 
mutual  information  should  be  constructed  on  a  sound  statistical  foundation.  When  computing  I(X,Y)  we 
should  incorporate  a  statistical  test  of  the  confidence  of  our  rejection  of  the  null  hypothesis  that  X  and  Y 
are  statistically  independent.  I(X,Y)=0  if  X  and  Y  are  statistically  independent.  In  practice,  we  wish  to 
know  if  a  computed  nonzero  value  of  I(X,Y)  is  statistically  significant.  Therefore,  given  time  series  X  and 
Y  of  length  Np ,  our  object  is  to  assess  the  null  hypothesis  that  X  and  Y  are  statistically  independent. 

The  null  hypothesis  of  statistical  independence  can  be  addressed  in  the  following  manner. 
Suppose  that  the  distributions  of  variables  X  and  Y  are  approximated  by  histograms  of  Nx  and  Ny 
elements.  In  most  applications  Nx  =  Ny,  but  this  is  not  required.  Ox(i)  is  the  observed  occupation 
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number  of  the  i-th  bin  of  the  variable  X  histogram.  Oy(j)  is  assigned  analogously.  OxyCbj)  is  the 
observed  occupation  number  (a  positive  integer,  not  a  fractional  probability)  of  element  i  j  of  the  XY 
partition.  There  are  NxNy  values  of  OxyCij)  •  Many  of  them  may  be  zero.  ExyO.  j)  is  the  expected 
occupancy  of  element  ij  of  the  XY  partition  given  the  assumption  that  X  and  Y  are  statistically 
independent. 


ExY(i.j)  =  NDPx(i)PY(j)  =  ND 


Ox(i)UOY(j)LOx(i)OY(j) 


Nr 


N 


D 


Nr 


where  it  should  be  noted  that  ExyO.  j)  is  not  necessarily  an  integer  and  is  the  number  of  x,  y  pairs. 


Following  conventional  statistical  practice  (Cochran,  1954;  Ott,  Longnecker  and  Ott,  1998),  we 
require  ExyCiJ)  ^  1  for  elements  of  the  partition  and  ExyCiJ)  ^  5  for  at  least  80%  of  these  elements. 
This  requirement  provides  a  rational  basis  for  selecting  the  number  of  histogram  bins  Nx  and  Ny  •  If  the 
condition  is  not  met,  Nx  and  Ny  should  be  decreased.  If  the  condition  is  satisfied  for  the  initially 
selected  values  of  Nx  and  Ny ,  then  they  can  be  increased  and  the  calculation  of  ExyCi,  j)  is  repeated. 
This  process  is  repeated  until  the  highest  values  of  Nx  and  Ny  consistent  with  the  constraints  on 
ExvCi,  j)  values  are  determined.  In  most  applications  Nx  =  Ny ,  and  this  calculation  is  straightforward. 


Once  the  final  values  of  Nx  and  Ny  are  determined  and  the  corresponding  values  of  OxyCi,  j) 

2 

and  ExyCi,))  are  calculated,  the  value  of  x  is  calculated. 

2  _^^{OxY(iJ)-ExY(i,j)}^ 
hVl  ExyCi, j) 


2  2 

The  condition  Exy  (i,  j)  ^  1  for  all  values  of  i,j  ensures  that  %  is  well  behaved.  In  addition  to  x  ,  v ,  the 
number  of  degrees  of  freedom,  is  also  computed. 

v  =  (Nx-1)(Ny-1) 


Using  and  v ,  the  probability  of  the  statistical  independence  null  hypothesis  is  computed. 

( 

Pnuu  =  probability  of  the  null  hypothesis  =  Q 


2  2 


Q  is  the  incomplete  gamma  function. 
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y  00  CO 

Q(x,y)  =  l--^  r(x)=  le-'t-ldt 

r(x)  J  r(x)  j  J 

IV.  Calculation  of  I(X,Y)  using  an  adaptive  XY  partition 

As  previously  outlined,  we  propose  that  calculation  of  mutual  information  should  be  statistically 

validated  by  application  of  a  y}  test  of  the  null  hypothesis  of  statistical  independence.  Additionally,  the 
partition  of  the  XY  plane,  which  is  used  to  calculate  the  joint  probability  distribution  PxY>  should  satisfy 
the  Cochran  (1954)  criterion  on  the  expectancies  Exy  Specifically,  we  require  ExyCbj)^!  for  all 
elements  of  the  partition  and  ExY(i.j)^5  for  at  least  80%  of  the  elements  of  the  partition.  In  the 

following  algorithm,  we  use  the  expectation  criterion  to  construct  a  nonuniform  XY  partition.  This 
procedure  has  two  advantages  over  the  use  of  a  naive  uniform  partition.  First,  it  reduces  sensitivity  to 
outlying  values  of  X  and  Y.  Second,  it  provides  an  approximation  of  the  highest  partition  resolution 
consistent  with  the  expectation  criterion. 

Let  denote  the  number  of  X,  Y  pairs.  is  the  number  of  elements  used  in  the  partition  of 
the  X  axis.  Ny  is  the  number  of  elements  used  to  partition  the  y  axis.  For  this  implementation  of  the 
algorithm,  Nx  and  Ny  are  equal  and  denoted  by  the  number  of  elements  .  We  stress  that  Ng  is  the 
number  of  elements  in  the  partition  of  an  axis.  It  is  not  the  number  of  elements  in  the  XY  plane,  which  is 
Ng .  The  specification  Ng  =  Nx  =  Ny  is  particularly  appropriate  when  data  set  Y  is  a  lagged  version  of 
data  set  X.  Ng  is  determined  by  the  following  procedure:  after  determining  Xjj,j„  and  x^^j^  >  l^e  x  axis  is 
partitioned  into  Ng  elements  so  that  there  is  an  equal  occupancy  in  each  element.  This  partition  is 
nonuniform  in  the  sense  that  the  widths  of  each  element  are  adjusted  individually  in  order  to  meet  the 
requirement  of  uniform  occupancy.  Let  Px(i)  denote  the  probability  of  X’s  membership  in  the  i-th 
element  of  the  x  axis  partition.  We  have 

Px(i)  =  l/NE 

Similarly,  after  determining  yj^^  and  y^^^^  ,  the  y-axis  is  partitioned  into  Ng  elements  so  that  there  is  an 
equal  number  of  occupants  in  each  y  axis  element. 

Py(j)=l/NE 
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Under  the  null  hypothesis  of  statistical  independence,  the  expected  occupancy  of  the  (io)-th 
element  of  the  partition  of  the  XY  plane  is 

ExY(iJ)  =  NDPx(i)PY(j)  =  ^ 

Ne 

Ng  is  determined  by  finding  the  largest  possible  value  that  gives  ExyCb  j)  ^  5  for  all  elements  of  the 
XY  partition.  This  criterion  is  therefore  more  conservative  than  the  Cochran  (1954)  criterion  that  requires 
Exy  to  be  greater  than  five  in  at  least  80%  of  the  elements.  Ng  is  the  greatest  integer  such  that 


Ne< 


\l/2 
D 

5  J 


PxY  (h  j)  is  calculated  using  this  partition.  Mutual  information  is  calculated  with  the  previously  derived 
formula. 


Ne  Ne  I 

I(X,Y)=:;2^PxY(i,j)log 

i=l  j=l  I 


PxvCbj)  1 
Px(i)PY(j)J 


2 

X  and  Pj^uii  are  calculated  as  previously  described.  If  Nq  is  exactly  divisible  by  Ng ,  then  the  formula 
for  mutual  information  simplifies  and  becomes 

Ne  Ne 

I(X,Y)  =  ^X!^’xY(i.j)log{N|PxY(i,j)} 

i=i  j=i 

However,  when  is  not  a  multiple  of  Ng,  elements  of  the  x  axis  and  y  axis  partitions  do  not  have 
exactly  identical  probabilities  equal  to  1/Ng ,  and  the  preceding  formula  should  be  used.  If  the  Cochran 
expectation  criterion  is  satisfied  (and  by  construction  it  will  be)  and  the  null  hypothesis  is  not  rejected, 
then,  to  the  extent  that  can  be  determined  by  calculations  with  this  algorithm,  the  two  data  sets  are 
statistically  independent.  Under  these  conditions,  reporting  a  nonzero  value  of  mutual  information  caimot 
be  justified.  Therefore,  in  cases  where  the  null  hypothesis  is  not  rejected,  the  algorithm  returns  I(X,Y)=0 
rather  than  the  numerical  value  produced  by  the  preceding  formula. 


The  application  of  this  procedure  to  the  Rossler  data  is  shown  in  Figure  6.  In  contrast  with  the 
results  of  Figure  3,  which  were  obtained  with  a  uniform  partition,  it  is  seen  that  the  first  minimum  of  the 
mutual  information  versus  lag  functions  obtained  with  x,  y  and  z-variable  data  approximately  coincide 
when  the  adaptive  partition  is  used.  The  probability  of  the  null  hypothesis  was  calculated  for  each  value 
of  Lag.  With  these  data,  Pj,|y|j  was  found  to  be  numerically  indistinguishable  from  zero  for  each  value  of 

Lag.  Since  the  data  set  Y  used  in  these  calculations  of  I(X,Y)  is  a  lagged  version  of  data  set  X,  this 
rejection  of  the  null  hypothesis  is  anticipated. 
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Figure  6.  Mutual  information  as  a  function  of  lag  using  Rossler  data.  The  data  used  in 
Figure  3  were  used  in  these  calculations.  N^ata  =  100000 .  Viewed  at  Lag=18,  the  curves 
from  the  X,  y  and  z  variables  have  the  top-down  order  of  x  to  z  to  y. 


In  the  algorithm  constructed  here,  the  number  of  elements  in  the  X  axis  partition  is  equal  to  the 
number  of  elements  in  the  Y-axis  partition.  This  number  is  denoted  by  Ng.  In  this  algorithm  Ng  is 
determined  by  the  Cochran  criterion.  Once  Ng  is  specified,  the  boundaries  of  the  partition’s  elements  are 
adjusted  so  that  each  X-axis  element  and  each  y  axis  element  have  the  same  occupancy.  P^y  (i,  j)  and 
I(X,Y)  are  calculated  using  this  partition.  Suppose  that  time  series  X  is  transformed  by  a  monotone 
increasing  function  h^  where  hjj  tnay  be  nonlinear.  Similarly  suppose  that  time  series  Y  is  transformed 
by  a  monotone  increasing  function  hy .  The  adaptive  partition  algorithm  for  calculating  mutual 
information  is  then  applied  to  calculate  I(hx(X),hy(Y)).  These  transforms  are  monotonic.  Therefore 
while  the  values  are  changed,  the  relative  ordering  of  elements  in  the  time  series  are  invariant.  When  the 
algorithm  is  applied,  the  location  of  the  boundaries  of  axis  partitions  will  be  shifted  by  the  occupancies  of 
each  element  will  be  unchanged,  that  is,  (i) ,  Py  (j)  ,  and  j)  are  unchanged.  Therefore  the  value 

of  mutual  information  is  unchanged.  This  is  summarized  in  the  following  result. 
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Theorem 

Let  X  and  Y  be  time  series  of  equal  length.  Let  hx  and  hy  be  monotone  increasing  functions.  If  mutual 
information  is  calculated  using  the  adaptive  partition  algorithm,  then 

I(X,Y)  =  I(hx(X),hY(Y)) 

Fraser  and  Swinney  (Fraser  and  Swinney,  1986;  Fraser,  1989)  have  constructed  an  alternative 
adaptive  partition  algorithm.  It  is  described  in  the  next  section. 


V.  The  Fraser-Swinney  Algorithm 


As  in  the  case  of  the  previous  algorithm,  the  calculation  is  directed  to  an  estimate  of  the  discrete 
form  of  the  mutual  information  integral. 


Nx  Ny 

I(X,Y)  =  J^PxY(i,j)log 


i=l  j=l 


^xyClj)  1 

Px(i)PY(j)J 


Numerical  approximation  of  the  joint  probability  distribution  PxY  constitutes  the  most  demanding 
element  of  the  computation.  The  Fraser-Swinney  algorithm  (Fraser  and  Swinney,  1986)  does  this  by 
constructing  a  locally  adaptive  partition  of  the  XY  plane. 


As  a  preliminary  exercise  leading  to  the  construction  of  the  algorithm,  consider  a  sequence  of 

partitions  Gq  ,  Gj ,  G2 , . ,  G^^ .  Each  partition  is  a  grid  of  4™  elements  generated  by  dividing  the 

X  and  Y  axis  into  2'"  equiprobable  elements,  that  is  the  boundaries  on  the  X  and  Y  axis  are  positioned  so 
that  Px  =  Py  =  1/ 2*"  for  each  element  of  the  partition.  Gq  is  the  entire  XY  plane.  R„(Km)  denotes  an 

element  of  the  partition  Gp, .  In  this  notation  index  K^p  runs  from  1  to  4™.  PxY(Rm(Kni))  is  the  value 
of  the  joint  probability  distribution  on  element  Rp,(Kp,) . 

PxY(Rm(Kn,))  =  N(R„(K„,))/No 

where  N(Rp,(Kp^))  is  the  occupancy  of  element  Rp,(Kp,)  and  Nq  is  the  total  number  of  XY  pairs. 
Using  this  notation  for  the  partition  and  the  equiprobability  of  the  X  and  Y  axis  partitions  gives  the 
following  expression  for  Ip,(X,  Y) ,  the  estimate  of  mutual  information  corresponding  to  partition  Gp, . 
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4^  r  " 


I„,(X,Y)  =  ml0g4+  J]PxY(Rm(K„,))l0gPxY(Rm(Km)) 
Km=I 

where  the  first  expression  for  Y)  was  simplified  using  the  relationships 

410 

;^Pxy(R,(K„))==1 


Px(R„(K,))  =  Py(R,(K„))-1/2'" 

The  essential  feature  of  the  Fraser-Swinney  algorithm  is  to  take  this  sequential  partitioning 
procedure  and  modify  it  to  produce  an  adaptive  partition  where  the  subdivision  of  any  given  element  is 
locally  determined  by  the  structure  of  the  joint  probability  distribution  P^y  on  that  element.  This  process 
can  be  depicted  by  the  tree  structure  shown  in  Figure  7.  In  this  notation,  Rg  is  the  XY  plane. 


R2(U)  R2(1,2)  R2(U)  R2(M) 


Figure  7.  Illustrative  example  of  the  adaptive  partition  employed  by  the  Fraser-Swinney 
algorithm.  In  this  hypothetical  example,  the  substructure  of  elements  Ri(2)  and  RfyS)  is 
approximately  uniform  and  these  elements  are,  therefore,  not  partitioned.  Elements 
Ri(l),  Ri(4)  and  R2(4,2)  are  partitioned  into  sub-elements  because  they  meet  the 
criterion  for  the  presence  of  smaller  scale  structure. 

The  notation  for  individual  elements  of  the  partition  is  revised  to  reflect  this  structure.  As  before, 
each  iteration  of  the  partition  is  effected  by  a  binary  equiprobable  division  of  the  X  and  Y  axes  of  an 
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element.  In  tree  notation,  an  individual  element  Rm  in  the  partition  G^,  is  identified  by  an  m-tuple 
Km=(ki,k2, . k„),R„(K„). 


A  finer  partition  is  used  in  areas  of  the  XY  plane  where  P^y  has  nonuniform  structure.  For  the 
hypothetical  example  in  the  diagram,  Pxyi^  deemed  to  be  approximately  uniform  on  Ri(2)  and  Ri(3) . 
The  partitioning  terminates  with  these  elements.  In  contrast,  Ri(l)  and  Ri(4)  have  locally  nonuniform 
joint  distributions  and  are  partitioned.  In  this  example,  partitioning  terminates  at  the  G2  level  with  the 
exception  of  element  R2(4,2) ,  which  has  a  nonuniform  joint  distribution  and  is  partitioned  into  four  G3 
elements,  R3  (4,2,1)  through  R3  (4,2,4).  As  previously  stated,  the  partitioning  continues  until  the  joint 
distribution  P^y  is  approximately  uniform.  A  justification  for  using  the  uniformity  of  P^y  as  the 
criterion  for  terminating  the  partitioning  process  can  be  established  by  examining  the  special  case  where 
Pxy  is  exactly  uniform  on  R^(Kn,) ,  where  =(kj,k2,"-kn^)  is  the  vector  that  identifies  an  element 

of  the  order-m  partition.  P^y  is  said  to  be  exactly  uniform  on  R^(K^)  if  P^y  values  on  the  subdivision 
Rn,+i(Kni>l)  to  Rm+i(K;m>4)  equal.  For  the  case  of  an  exactly  uniform  partition  on  R^(K„)  we 
have  the  following: 

Pxy(Rm+l(Kma))  =  PxY(Rm+l(Kn,>2))  =  PxY(Rni+l(Kn„3»=Pxy(R„,+l(K^,4)) 


Let  I^(R^(K^))  denote  the  contribution  of  element  R^(Kjj,)  to  mutual  information.  For  the  general 
case,  we  have 


I„(R,(K„))  =  Pxy(R„(K„))log 


PxY(Rm(Km)) 


Px(R,(K„))Py(R,(K„))J 
where  by  construction  Px(Rm(K„))  =  Py(R„(K^))  =  l/2'".  I„+,(R„(K^,j))  is  defined  analogously 
on  each  of  the  four  subdivisions  of  R^(K^).  Using  the  equiprobability  of  the  partition  gives 

Px(P^m+i(Km’ j))  ~  PY(P^m+i(Km>  j))  ~  .  The  definition  ofthe  joint  probability  distribution  gives 

PxY  (Rm+1  (K™ ,  j))  =  N(R,.,i  (K„  ,  j)  /  No 
where  by  construction  of  the  partitioning  process 

N(R„(K„))  =  XN(R„„(K„,j)) 

j=l 


These  relationships  are  generically  valid.  However,  for  the  special  case  where  Pxy  is  exactly  uniform  on 
Rn,  (K^)  it  can  additionally  be  shown  that 
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j=i 

Thus,  in  the  case  where  P^y  is  exactly  uniform  on  Rj^CK^^),  dividing  the  partition  element  into  four 
subdivisions  will  have  no  effect  on  the  contribution  to  mutual  information  obtained  from  that  element. 
Terminating  the  partitioning  process  at  level  is  therefore  justified  in  this  case.  The  qualifying  phrase 
“from  that  element”  is  crucial  to  our  understanding  of  the  algorithm.  If  the  uniformity  condition  were  not 
met,  the  equality  expressed  in  the  immediately  preceding  equation  would  not  be  obtained.  As  a  practical 
matter,  however,  it  is  necessary  to  establish  a  criterion  that  can  be  used  to  terminate  the  partitioning 
process  for  some  specific  element  Rn,(Kjr,)  when  P^y  is  nearly,  but  not  exactly,  uniform  on  that 

element.  In  their  paper,  Fraser  and  Swiimey  (1986)  construct  a  test  for  uniformity  that  uses  a  test  to 
examine  structure  on  both  the  m+1  and  m+2  generation  partition  of  R^(Kfn).  Let  N  =  N(R^(Kjj,)) 
denote  the  number  of  XY  pairs  in  element  Rn^(K^) .  Using  analogous  notation  for  the  subdivisions,  let 
Ej  =  N(Rn,_^i(K^,i))  and  let  bjj  =  N(Rm+2(K:m’hj))  •  Fraser  and  Swinney  criterion,  P^y  will  be 

deemed  to  be  effectively  uniform  on  Rn^(K^)  and  the  partitioning  process  will  be  terminated  on  that 
element  if  both  X3  <1.547  and  xfs  <  1.287  ,  where 


It  should  be  noted  that  while  the  Fraser-Swinney  algorithm  uses  a  %  criterion  to  control 

subdivisions  of  the  XY  plane  locally,  it  does  not,  in  contrast  that  the  algorithm  of  the  previous  section, 
provide  a  global  statistical  assessment  of  an  I(X,Y)  calculation  which  includes  the  probability  of  the  null 
hypothesis  of  statistical  independence.  The  code  implementing  their  algorithm  distributed  by  Fraser  and 
Swinney  departs  from  the  partition  termination  criterion  outlined  in  the  text  of  their  paper.  In  their  code, 
the  probe  for  structure  is  conducted  at  only  one  sublevel  and  the  partitioning  process  is  terminated  if 

X3  <1-547. 

Results  obtained  when  our  implementation  of  the  Fraser-Swinney  algorithm  with  a  single-level 
partition  termination  criterion  of  X3  <1-547  was  applied  to  the  Rossler  data  of  Figure  3  are  shown  in 
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Figure  8.  In  our  implementation,  as  in  the  case  of  the  Fraser-Swinney  code,  the  length  of  data  sets  X  and 
Y  must  be  a  power  of  two.  Visual  comparison  of  the  results  obtained  with  the  Fraser-Swinney  algorithm 
and  Npata  ~  65,536  (Figure  8)  with  the  results  obtained  with  the  algorithm  of  Section  IV.  and 
Noata  =  100,000  suggest  that  similar  results  were  obtained.  This  point  is  emphasized  in  Figure  9  which 
shows  that  superposition  of  the  results  obtained  when  N^ata  =  65,536  for  both  algorithms.  The  values  of 
lag  corresponding  to  the  first  minimum  of  the  mutual  information  versus  lag  function  obtained  with  the 
two  algorithms  are  either  equal  or  differ  by  one. 


Figure  8.  Mutual  information  as  a  function  of  lag  using  the  Rossler  data  of  Figure  3 
Mutual  information  was  calculated  using  the  Fraser-Swinney  algorithm  when 
Nd  =  65,536 .  Viewed  at  Lag=18,  the  curves  from  the  x,  y  and  z  variables  have  the  top- 
doAvn  order  of  x  to  z  to  y. 
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Rossler  X  Data 


Figure  9.  Direct  comparison  of  results  obtained  with  the  algorithm  of  Section  IV.  and  the 
Fraser-Swiimey  algorithm  using  Rossler  data  of  Figure  3.  =  65,536 .  For  those  values 

of  lag  where  the  results  of  the  two  algorithms  differ,  the  results  of  the  algorithm  of 
Section  IV.  are  below  the  results  obtained  with  the  Fraser-Swinney  algorithm 

We  now  have  two  candidate  procedures  for  calculating  I(X,Y),  the  Fraser-Swiimey  algorithm  and 
the  adaptive  partition  algorithm  presented  in  Section  IV.  A  procedure  for  comparing  the  two  methods  is 
constructed  in  the  next  section. 


VI.  Comparing  algorithms 

In  the  previous  sections,  two  procedures  for  computing  mutual  information  were  presented.  They 
are  compared  in  this  section.  Two  properties,  accuracy  and  speed,  are  examined.  A  comparison  of 
accuracy  requires  example  cases  where  the  true  value  of  mutual  information  is  known  to  a  high  accuracy. 
This  can  be  provided  by  jointly  Gaussian  data  sets.  Two  data  sets  are  said  to  be  jointly  Gaussian  if  their 
joint  probability  density  fiinction  centered  at  (m^  ,my)  has  the  form 
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PxY(x,y)  = 


1 

- TTTrexp 

27caxay(l-r^)*^^ 


-1 


2(l-rO 


x-m^ 


f 

-2r 


V  y 


x-m^ 


‘^X  / 


V y-niy  ^ 
V  j 


^2 


y-m, 

\  j 


nijj  and  are  the  mean  and  standard  deviation  of  time  series  {X}.  my  and  Oy  are  defined  analogously 

for  (Y},  and  r  is  the  cross-correlation  coefficient  between  {X}  and  {Y}.  For  the  case  of  jointly  Gaussian 
data  sets,  the  mutual  information  is  analytically  related  to  the  correlation  coefficient  by 

I(X,Y)  =  -0.51og(l-r2) 

(Mars  and  Lopes  da  Silva,  1987).  A  derivation  of  the  relationship  is  given  in  Appendix  2.  The 
construction  of  a  procedure  for  generating  jointly  Gaussian  data  sets  with  a  specified  correlation 
coefficient  is  also  presented  in  that  appendix. 

Mutual  information  estimates  obtained  with  the  algorithm  of  Section  IV  and  with  the  Fraser- 
Swirmey  algorithm  are  compared  against  -.51og(l  -r^)  for  the  case  of  jointly  distributed  Gaussian  data 
in  Figure  9.  Ninety  nine  values  of  r,  uniformly  distributed  on  (-1,1)  were  used  in  these  calculations.  For 
each  value  of  r,  one  hundred  jointly  distributed  {X},  {Y}  data  set  pairs  of  length  8,192  were  generated. 
The  average  value  of  mutual  information  for  these  pairs  was  determined  using  both  algorithms.  Multiple 
variants  of  each  algorithm  were  used.  The  irregular  I(X,Y)  versus  r  function  seen  in  Figure  9  was 
produced  using  the  Fraser-Swinney  algorithm  when  the  sub-partitioning  process  was  terminated  with  the 

criterion  Xa  <1.547 .  With  this  criterion,  an  element  of  the  partition  is  subdivided  if  the  probability  of 
nonuniform  substructure  is  greater  than  27%.  This  is  the  criterion  implemented  in  their  code.  Calculations 
were  also  performed  using  Xa  <5.000.  This  criterion  results  in  the  subdivision  of  an  element  of  the 
partition  only  if  the  probability  of  nonuniform  substructure  is  at  least  80%.  In  this  case,  the  results  were 
much  closer  to  -0.51og(l  -r^) .  Three  variants  of  the  algorithm  constructed  in  Section  IV  were  used.  In 
the  first  instance,  the  number  of  elements  in  the  partition  were  chosen  so  that  ExY(iJ)^5  for  all 
elements.  Recall  that  ExyOj)  is  the  expected  occupancy  in  partition  element  (ij)  given  the  null 
hypothesis  of  statistical  independence;  ExY(iJ)  =  NDPx(i)PY(j)  where  is  the  number  of  elements 
in  the  time  series  {X}  and  {Y}.  Calculations  also  were  performed  with  the  Section  IV  algorithm  with 
^XY(i>j)-iO  and  with  ExY(i)j)^15.  In  the  case  of  the  Section  IV  algorithm,  the  value  I(X,Y)=0  is 
returned  whenever  the  null  hypothesis  of  statistical  independence  is  not  rejected  with  a  confidence  level 
of  at  least  95%.  This  convention  accounts  for  the  transition  to  I(X,Y)=0  in  the  vicinity  of  r=0  for  I(X,Y) 
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functions  obtained  with  this  algorithm.  Viewed  at  r=.2  the  top-down  ordering  of  the  I(X,Y)  versus  r 
functions  is  (i.)  Fraser-Swinney  algorithm  with  %3  <1.547,  (ii.)  the  algorithm  of  Section  IV  with 
E^y  (i,  j)  ^  5 ,  (iii.)  the  algorithm  of  Section  IV  with  E^y  (i>  j)  ^  10  >  (iv.)  the  algorithm  of  Section  IV  with 
ExY(i>j)^15,  (v.)  the  Fraser-Swiimey  algorithm  with  xf  <5.000,  (vi.)  the  analytical  solution 
-  0.5  log(l  -  r  ) .  The  greatest  numerical  value  of  I(X,Y)  is  obtained  with  the  Fraser-Swinney  algorithm 

with  a  subdivision  criterion  of  <1.547  which  results  in  subdivisions  whenever  the  probability  of 
nonuniform  substructure  exceeds  27%.  This  produces  the  greatest  value  of  I(X,Y)  because  the 
comparatively  tolerant  criterion  of  27%  introduces  a  numerical  indication  of  small  scale  structure  in  the 
data  (and  hence  a  greater  value  of  mutual  information)  that  may  not  be  present.  With  the  more  demanding 

criterion  of  y}  <5.000,  a  subdivision  is  introduced  only  if  there  is  at  least  an  80%  probability  of 
nonuniform  substructure.  With  this  criterion  there  is  less  divergence  between  the  algorithm-estimated 
value  of  mutual  information  and  the  analytically  computed  value  of  -  0.5  log(l  -  r  ^ ) . 
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Figure  10.  Comparing  the  Fraser-Swinney  algorithm,  the  algorithm  of  Section  IV.  and 
-.51og(l-r^)  for  jointly  distributed  Gaussian  data.  Ninety-nine  values  of  correlation  r 
uniformly  distributed  on  (-1,1)  were  used.  =  8,192 .  For  each  value  of  r,  one  hundred 
{X},  {Y}  data  set  pairs  were  generated.  The  algorithm’s  average  value  of  mutual 
information  is  displayed.  Viewed  at  r=.2  the  top-down  ordering  of  the  I(X,Y)  versus  r 

functions  is  (i.)  Fraser-Swinney  algorithm  with  Xs  <1.547  ,  (ii.)  the  algorithm  of  Section 
IV.  with  ExY(i>j)^5,  (hi.)  the  algorithm  of  Section  IV.  with  ExY(i)j)^10,  (iv.)  the 
algorithm  of  Section  IV.  with  ExY(i)j)^15,  (v.)  the  Fraser-Swinney  algorithm  with 

xf  <  5.000 ,  (vi.)  the  analytical  solution  -  0.5  log(l  -  r^ ) 


Following  Hamilton  (1964),  the  following  error  measure  was  calculated. 


ERROR = 


99  /  \7 

y^(l(X  —  I(X  j 


i=l 


99  ( 

£(i(x,Y)^^y‘‘'="‘) 

i=l 


where  I(X,  denotes  the  value  obtained  using  -  .5  log(l  -  r  ^ ) .  The  results  are  shown  in  the  next 

table.  It  is  seen  that  the  magnitude  of  the  error  is  low  with  both  algorithms. 
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Table  2.  Average  Normalized  Error  in  the  Estimation  of  Mutual  Information 


Algorithm 

ERROR 

Algorithm  of  Section  IV 
^XY  (i>  j)  -  ^ 

1.91x10"^ 

Algorithm  of  Section  IV 
^XY  (h  j)  ^10 

1.55x10“^ 

Algorithm  of  Section  IV 
^XY(b  j)-15 

3.15x10“^ 

Fraser-Swinney  Algorithm 

X3  <1-547 

2.48x10"' 

Fraser-Swinney  Algorithm 

X3  <5.000 

0.97x10“^ 

In  addition  to  providing  an  explicit  assessment  of  the  probability  of  the  null  hypothesis  of 
statistical  independence,  the  algorithm  of  Section  IV  offers  an  additional  advantage  over  the  Fraser- 
Swinney  algorithm;  it  is  much  faster.  Comparison  of  computation  times  with  data  sets  of  different  lengths 
is  given  in  Table  3.  Both  programs  were  run  in  Matlab  6.5.0  (R13)  on  a  Pentium  4  processor  running  at 
2.53  GHz.  The  computation  times  of  the  algorithm  of  Section  IV  are  typically  on  the  order  of  .5%  of  the 

times  required  by  the  Fraser-Swinney  Algorithm.  In  addition  to  being  more  accurate  than  the  Xs  <1.547 

criterion,  the  Xs  <  5.000  algorithm  is  faster  because  it  introduces  fewer  subdivisions. 


Table  3.  Comparative  Computation  Times  for  Different  Algorithms 


^Data 

Time 

Algorithm  of  Section  IV 
(seconds) 

Time 

Fraser-Swinney  Algorithm 
X3  =1.547  (seconds) 

Time 

Fraser-Swiimey  Algorithm 
X3  =5.00  (seconds) 

4096 

1.3 

266.2 

185.2 

8192 

2.7 

544.0 

392.4 

16384 

5.0 

1169.5 

851.0 

32768 

9.3 

2549.5 

1898.5 

65536 

24.1 

5940.5 

4533.5 

An  approximate  understanding  of  the  sensitivity  of  the  two  algorithms  to  data  set  size  can  be 
obtained  by  examining  the  results  presented  in  Figure  1 1 .  That  diagram  shows  the  mutual  information 
versus  lag  functions  obtained  from  a  single  data  set  generated  by  the  Rossler  equations  (x  variable  data). 
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As  already  seen  in  Figure  8,  the  results  obtained  when  Nq  =65536  are  almost  identical.  More 
substantive  differences  are  observed,  however,  when  smaller  data  sets  are  used.  When  is  4096  and 
8192,  the  algorithm  of  Section  IV  produces  output  that  is  slightly  less  than,  but  largely  parallel  to,  the 
results  obtained  when  Njj  =65536 .  For  this  algorithm,  the  value  of  Lag  giving  the  first  minimum  of 
mutual  information  was  the  same  for  all  values  of  tested.  In  contrast,  when  Nq  =4096  and  8192, 

the  Fraser-Swinney  algorithm  produces  mutual  information  versus  lag  functions  that  present  structures 
that  are  lost  when  more  data  are  incorporated  into  the  computations.  In  some  instances,  these  structures 
can  alter  the  identification  of  the  lag  giving  the  minimum  value  of  mutual  information. 
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Fraser-SwinneyAlgonlhmx  =1.547 


80  100  120 


Fraser-Swinney  Algorithm  x  =5.000 


Figure  11.  Mutual  Information  versus  Lag  for  Data  Sets  of  Different  Sizes.  Mutual 
information  versus  lag  was  computed  using  both  algorithms  for  Nq  =4096,  8192, 
16384,  32768  and  65536.  The  data  were  generated  by  the  Rossler  equations,  and  x- 
variable  output  was  used  in  the  calculations.  Functions  calculated  with  Np  =  65536  are 
at  the  top  of  each  set  of  curves.  Functions  calculated  with  =  4096  are  at  the  bottom 
of  each  set  of  curves. 


VII.  Discussion 


The  Fraser-Swiimey  algorithm  with  the  Xa  <5.000  criterion  out-performs  that  algorithm  when 
Xa  <  1.547  is  used  both  in  terms  of  accuracy  (Table  2)  and  speed  (Table  3).  Given  a  dichotomous  choice 
between  these  two  options,  the  Xa  <5.000  variant  would  be  preferable.  A  comparison  of  the  Fraser- 

Swinney  algorithm  with  the  Xa  <5.000  criterion  against  the  algorithm  of  Section  fV  leads  to  the 
following  conclusions.  First,  the  algorithm  of  Section  IV  has  a  significant  advantage  over  the  Fraser- 
Swinney  algorithm  in  providing  a  global  test  of  the  statistical  independence  null  hypothesis.  The  Fraser- 

Swinney  algorithm  uses  a  x^  test  locally  to  implement  the  partitioning  protocol.  It  does  not,  however, 
return  an  assessment  of  the  statistical  independence  of  X  and  Y.  Second,  while  the  Fraser-Swinney 
algorithm  is  more  accurate  with  data  sets  where  =8192  (Table  2),  the  results  of  Figure  11  suggest 

that  the  Fraser-Swinney  algorithm  requires  large  data  sets  even  when  the  X3  <5.000  criterion  is  used. 
When  smaller  data  sets  are  used  the  Fraser-Swinney  algorithm  presents  structures  that  disappear  when 
more  data  becomes  available.  If  the  object  of  the  calculation  is  to  use  I(Xj,Xj^Lag)  functions  to  find  the 

appropriate  Lag  for  embedding,  then  these  local  minima  could  give  misleading  results.  Third,  the 
algorithm  of  Section  IV  requires  about  .5%  of  the  calculation  time  required  by  the  Fraser-Swinney 
algorithm. 

Limitations  of  this  study  should  be  noted.  Additional  algorithms  could  be  considered.  Following 
Silverman  (1986),  Moon,  et  al.  (1995)  have  used  kernel  density  estimators  to  calculate  probability 
densities.  They  argue  that  the  resulting  algorithm  outperforms  the  Fraser-Swinney  algorithm.  Moon,  et  al. 
also  suggest  that  their  algorithm  can  be  improved  by  using  K-d  trees  to  partition  the  data.  Caution  must  be 
exercised  when  evaluating  this  suggestion.  Our  exploratory  calculations  have  shown  that  K-d  tree 
partitions  can  be  very  sensitive  to  initial  conditions.  This  sensitivity  is  addressed  by  Bradley  and  Fayyad 
(1998)  who  published  a  procedure  for  computing  initial  conditions  based  on  a  procedure  for  estimating 
the  modes  of  a  distribution. 

Instead  of  partitioning  phase  space  as  is  done  in  the  algorithms  discussed  above,  Pawelzik  and 
Schuster  (1987)  used  the  first  order  correlation  integral  to  calculate  probability  densities  and  entropies. 
These  entropies  are  then  used  to  calculate  mutual  information.  We  consider  here  application  of  the 
technique  to  embedded  time  series  data,  Xij=(xi(.,Xi;^+Lag,Xi(.+2Lag> ^k+(m-i)Lag) 
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Yk  =  (yk.yk+Lag.yk+2Lag. . yk+(m-i)Lag)  k  =  1,  N-m+1.  Application  to  scalar  data  is  trivially 

obtained  by  taking  the  embedding  dimension,  m,  to  be  one  for  X  and  Y,  and  thus  dimension  two  for  the 
joint  space.  The  density  of  X  in  the  neighborhood  of  X^  is  approximated  by  the  first  order  correlation 
integral, 

Px,W  =  ^i;0(  r-|Xj-X^|  ), 

jVk 

where  ©  is  the  Heaviside  function,  N  y  is  the  number  of  embedding  vectors,  and  r  is  the  neighborhood 
size  being  considered.  This  density  differs  from  that  used  earlier  because  it  counts  the  number  of  points  in 
possibly  overlapping  neighborhoods.  The  densities  used  in  the  algorithms  discussed  earlier  involved  non¬ 
overlapping  neighborhoods  created  by  the  partitioning  process.  This  leads  to  a  slightly  different 
expression  for  the  entropy  which,  in  this  case,  is  given  by 

H(X,r)  =  -^|;inpx,(r) 
k=l 

In  some  implementations,  finite  sample  corrections  due  to  Grassberger  are  included  (Grassberger,  1988). 
The  entropies  of  the  Y  data  as  well  as  the  joint  entropy  are  calculated  similarly,  and  these  are  used  to 
obtain  the  mutual  information  from  the  relation  I(X,Y)=H(X)+H(Y)-H(X,Y). 

Quian  Quiroga  et  al.,  (2002)  used  the  Pawelzik-Schuster  algorithm  with  the  Grassberger 
corrections  in  a  study  of  synchronization  of  rat  electrocorticograms.  They  studied  three  multichannel 
ECoG  records  in  a  rat  model  of  genetic  absence  epilepsy  and  compared  activity  between  left  and  right 
hemispheres.  (The  data  they  analyzed  are  available  at  viWw.vis.caltech.edu/~rodri).  The  first  record,  their 
example  A,  was  obtained  in  an  interictal  condition  and  the  remaining  two,  their  examples  B  and  C,  were 
recorded  during  seizures  and  presented  repetitive  spike  discharges.  In  addition  to  mutual  information,  the 
synchronization  measures  examined  included  nonlinear  interdependencies,  phase  synchronizations,  cross 
correlation,  and  the  coherence  function.  They  concluded  that  except  for  mutual  information  their  linear 
and  nonlinear  measures  provided  qualitatively  similar  results.  Namely,  interhemispheric  synchronization 
was  highest  in  example  B,  followed  by  A  and  then  C.  The  authors  felt  that  the  small  number  of  data 
points  (N=1000)  was  responsible  for  the  failure  of  mutual  information  to  provide  robust  estimates  of 
interhemispheric  synchronization. 

These  data  were  re-analyzed  by  Duckrow  and  Albano  (2003)  using  a  modified  Fraser-Swinney 
algorithm.  The  data  were  embedded  and  interleaved  as  described  in  Appendix  3  and  the  resulting  binary 
representations  were  used  as  inputs  in  the  Fraser  Swinney  algorithm.  Using  embedding  dimensions  from 
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1  to  10  and  Lags  from  1  to  30,  the  results  consistently  showed  the  B  >  A  >  C  ranking  that  Quian  Quiroga 
et  al.  found  using  other  measures  of  synchronization.  Results  obtained  by  Duckrow  and  Albano  using 
these  data  and  a  uniform  partition  algorithm  showed  a  behavior  similar  to  that  found  by  Quian  Quiroga 
when  they  used  the  Pawelzik-Schuster  algorithm.  The  expected  ranking  of  B  >A  >  C  was  seen  only  for 
embedding  dimensions  1  and  2,  and  the  value  of  mutual  information  increased  with  increasing  embedding 
dimension  (Figure  12).  This  is  likely  a  consequence  of  small  data  set  size.  An  examination  of  the  uniform 
partition  algorithm  indicates  why  this  should  be  the  case.  If  one  considers  X  and  Y  as  N  uniformly 
distributed  random  numbers,  the  individual  and  joint  probabilities  would  be  Px  =1/N,  Py  =1/N,  and 

PxY  =  1/N^  .  The  resulting  mutual  information  would  be  zero.  However,  when  sparse  data  are  scattered 
into  a  large  multidimensional  embedding  space  it  becomes  unlikely  that  more  than  one  point  is  present  in 
any  one  histogram  element.  In  this  case  the  joint  probability  is  closer  to  1/N  than  1/  N  The  resulting 
mutual  information  would  be  log(N).  When  a  fixed  bin-width  histogram  method  was  applied  to  sets  of 
N=1000  embedded  uniformly  distributed  random  numbers,  Duekrow  and  Albano  found  that  the  resulting 
average  mutual  information  value  rose  quickly  with  increasing  embedding  dimension  and  rapidly 
approached  log(N),  as  shown  by  the  dash-dot  line  in  Fig.  12. 


Fixed  bin-width  histogram  method 


Figure  12.  Average  mutual  information  at  increasing  embedding  dimension  with  fixed  Lag  =  10  (From 
Duckrow  and  Albano,  2003). 

In  contrast,  Albano  and  Duckrow  found  that  the  average  mutual  information  of  same  the  random 
numbers  calculated  with  the  Fraser-Swinney  algorithm  had  a  median  value  of  1.1  x  10'^  and  did  not 
exceed  0.02.  Interleaving  data  embedded  in  m  dimensions  represents  m-dimensional  vectors  as  scalars 
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and  reduces  the  calculation  of  the  joint  probability  in  a  2m-dimensional  space  to  a  two-dimensional 
calculation.  Increasing  the  embedding  dimension  does  not  make  the  embedded  points  any  sparser  in  the 
reduced  space,  and,  therefore,  calculation  of  the  mutual  information  proceeds  in  the  same  maimer  as  for 
calculation  for  unembedded  data.  One  does  not  get  the  artificial  increase  in  the  calculated  mutual 
information  that  results  from  scattering  a  limited  number  of  points  in  spaces  of  ever-increasing 
dimensions  observed  with  imiform  partition  algorithms.  In  subsequent  work,  Quian  Quiroga  et  al.  (2003) 
repeated  their  calculations  using  interleaved  embedded  data  as  inputs  to  the  Pawelzik-Schuster  algorithm 
and  confirmed  the  B  >  A  >  C  ranking  of  mutual  information  reported  by  Albano  and  Duckrow.  In  this 
contribution,  they  suggest  that  a  more  precise  method  for  calculating  the  mutual  information  may  be  one 
that  estimates  probability  densities  using  k-th  nearest  neighbor  distances  rather  than  the  number  of 
neighbors  in  neighborhoods  of  fixed  size. 

Yet  another  approach  to  calculating  mutual  information  has  been  published  by  Kilminster,  et  al. 
(2002)  who  have  shown  that  the  Radon  transform  can  be  used  to  estimate  joint  probability  density 
functions  which  can  then  be  used  to  estimate  mutual  information.  They  argue  that,  in  contrast  with 
standard  methods,  this  procedure  preserves  fractal  stmcture.  The  Kilminster  et  al.  and  the  Moon,  et  al. 
algorithms  could  be  compared  against  the  Fraser-Swinney  algorithm  and  the  algorithm  of  Section  IV  in 
an  expanded  study. 
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Appendix  1.  Mutual  Information:  Definition  and  Mathematical  Characterization 

A.  Define  a  system,  {xi,X2, . {Px(l),Px(2), . Px(Nx)} 

B.  Define  the  information  in  the  i-th  S3mibol,  I(i) 

C.  Define  the  entropy  of  a  system,  H(X) 

D.  Define  the  joint  probability  distribution,  Pxy  (i,j) 

E.  Define  the  conditional  probability  distribution,  Px|Y  (b  j) 

F.  Define  the  joint  entropy,  H(X,Y) 

G.  Define  the  conditional  entropy  H(X|Y) 

H.  Define  mutual  information  I(X,Y) 

A.  Define  a  system,  {xi,X2, . x^^},  {Px(l),Px(2), . Px(Nx)} 

X  is  a  system  consisting  of  a  discrete  set  of  possible  s3nnbols,  {X],X2, . The  set  of 

possible  symbols,  which  is  often  referred  to  as  the  alphabet,  is  to  be  distinguished  from  an  output 
sequence,  or  message,  generated  by  system  X.  An  output  sequence  is  an  ordered  sequence  of  symbols 
drawn  from  the  symbol  set.  Throughout  we  use  Nx  to  denote  the  number  of  elements  in  the  symbol  set 
and  Nj) ,  where  ‘D’  denotes  data,  to  denote  the  length  of  a  message.  The  associated  probabilities  of  each 

element  of  the  symbol  set  is  given  by  {Px(l),Px(2), . Px(Nx)}  >  where  probabilities  have  the  property 

Nx 

EPx(i)  =  l 

i=l 

In  many  of  the  applications  implemented  here,  symbol  Xj  denotes  the  presence  of  an  event  in  the  i-th 
element  of  a  histogram  that  is  composed  of  Nx  bins.  In  this  context,  Px(i)  is  the  occupation  probability 
of  the  i-th  bin. 

B.  Define  the  information  in  a  symbol,  I(i) 

The  information  content  of  the  i-th  symbol  is 

I(i)  =  -log2Px(i) 

Throughout,  all  logarithms  are  computed  in  base  2,  and  the  resulting  values  are  reported  in  bits.  (In  some 
texts  the  natural  logarithm  is  used  and  the  reported  units  are  “nats.”)  Suppose  the  probability  of  a  s3mibol 
is  1 .  In  this  case,  nothing  is  learned  by  observing  it.  The  corresponding  information  is  -  log2  (1)  =  0  .  As  a 
symbol  become  increasingly  improbable,  its  associated  information  increases. 
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C.  Define  the  entropy  of  a  system,  H(X) 

Information  is  a  property  of  a  symbol.  In  contrast,  entropy  is  a  property  of  a  system.  Entropy, 
H(X),  is  the  average  amount  of  information  gained  from  an  observation  of  x.  Restated,  H(X)  is  the 
average  uncertainty  in  x  prior  to  its  observation. 

Nx 

H(x)=2;Px(i)i(i) 

i=l 

Nx 

H(X)  =  -£Px(i)log2Px(i) 

i=l 

This  can  be  generalized  to  a  continuous  variable 

H(X)  =  -JPx(x)log2Px(x)dx 

In  the  discrete  case,  suppose  an  observation  can  be  in  one  of  sixty-four  bins.  Thus  there  are  sixty-four 

possible  s3mbols,  {xj, . Xg4}.  Suppose  further  that  the  probability  of  any  given  symbol  is  the  same, 

P(i)  =  1/64  for  all  i. 


64  ^  f  I  \ 

H(X)  =  -gPx(i)log2  Px(i)  -  ^  2®  =  6  bits 


Similarly,  if  there  were  128  equiprobable  symbols,  II(X)=7  bits. 


For  the  general  case,  using  the  convexity  of  xlogx,  it  can  be  shown  that  the  maximum  value  of 
entropy  is  obtained  when  all  symbols  have  the  same  probability.  Additionally,  using  a  series  expansion  of 
logx,  it  can  be  shown  that 

lim  xlogx  =  0 
x->0 

Therefore,  as  previously  asserted,  if  Px  (j)  =  1  and  Px  (k)  =  0  for  all  k  j ,  then  H(X)=0. 


D.  Define  the  joint  probability  distribution,  Pxy  (i>  j) 

Consider  two  systems,  X  with  the  output  sequence  {xi,X2, . Xj^^^ }  and  system  Y  presenting 

the  message  {yi,y2, . yN^}  •  When  defining  a  joint  probability  distribution,  one  must  emphasize  the 

previously  made  distinction  between  Nx  the  number  of  different  symbols  that  can  be  presented  by 
system  X,  Ny  the  number  of  distinct  symbols  that  can  be  presented  by  system  Y,  and  the  number  of 
observed  (x,y)  pairs.  As  before,  the  independent  probability  of  each  symbol  of  system  X  is  denoted  by 
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{Pj;;(l),Px(2), . ’PxCNx)}-  Similarly,  the  independent  probability  distribution  of  system  Y  is 

{Py  (1), Py  (2), . Py  (Ny)} .  The  joint  probability  distribution  P^yO,  j)  is  the  probability  that  an  (x,y) 

pair  consists  of  the  i-th  system  X  symbol  and  the  j-th  system  Y  symbol.  It  should  be  noted  that 
PxY(i>  j)  ~  Px  (^)Py  (j)  if  if  ^  Y  are  independent. 

Lemma:  Relationship  between  joint  probability  distributions  and  single  variable  distributions 

Ny 

Px(i)  “ 

j=l 

Demonstration: 

By  summing  over  all  possible  y  values,  the  probability  of  y,  whatever  value  it  might  be,  is  1 .  Therefore 
the  remaining  value  in  the  sum  is  the  probability  of  the  system  X  sjnnbol. 

E.  Define  the  conditional  probability  distribution,  Px|y  (i,  j) 

Given  systems  X  and  Y  defined  above,  the  conditional  probability  distribution  is  denoted  by 
Px|Y  ■  Px|Y(i>j)  ^he  probability  that  X  =  Xj  given  that  it  is  already  known  that  Y  =  yj.  The 

relationship  between  the  joint  probability  distribution  and  the  conditional  probability  distribution  is 
established  by  the  following  lemma. 

Lemma:  The  relationship  between  the  conditional  probability  distribution  and  joint  probability 
distribution  is  given  by: 

PxY  (h  j)  ~  Px|Y  (h  j)PY  (j) 

Demonstration: 

Py  (j)  is  the  probability  that  Y  =  y j .  Px|Y  (i,  j)  is  the  probability  that  X  =  Xj  if  it  is  already  known  that 

Y  =  yj.  Therefore  the  product  of  these  probabilities  is  the  probability  that  both  X  =  Xi  and  Y  =  yj, 
which  by  definition  is  Pxy  (i,  j) . 

If  X  and  Y  are  independent,  then  Pxy  (h  j)  =  Px  (i)PY  ( J)  >  and  Px  (i)  =  Px|y  (h  j)  ■  That  is,  if  X  and 

Y  are  independent,  then  the  probability  that  X  =  Xj  is  determined  solely  hy  system  X. 

F.  Define  the  joint  entropy,  H(X,Y) 

Given  X  and  Y  systems  as  previously  defined,  the  joint  entropy  is  defined  as: 
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Nx  Ny 

H(X,Y)  = 

(i>j)log2  PxyO’J) 

i=i  j=i 

H(X,Y)  is  the  average  amount  of  information  gained  by  observing  an  (x,y)  pair. 
Lemma:  The  joint  entropy  of  a  system  with  itself  is  given  by 

H(X,X)=H(X) 

Demonstration: 

By  definition 

Nx  Ny 

H(X, Y)  -  -25]PxY(iJ)log2  PxY(h j) 

i=l  j=l 

If  Y=X,  then  PxyOj  j)  -  SjjPxCO  where  dy  is  Rronecker’s  delta 

Nx  Ny 

H(X,  Y)  =  -  JXSijPxWloga  5ijPx(i) 

i=i  j=i 

limzlogz  =  0 
z—>0 

Hence 

Nx 

H(X,X)  -  -£Px(i)log2  Px(i)  =  H(X) 


Lemma:  The  joint  entropy  function  is  symmetric,  that  is 

H(X,Y)=H(Y,X) 

Demonstration: 

By  definition 

Nx  Ny 

H(X,Y)  - 

(ij  j)log2  PxvCh  j) 

i=l  j=l 

The  joint  probability  distribution  is  not  a  conditional  probability  distribution;  Pxy  (i,j)-PYx(j,i). 
Therefore: 

H(X,Y)=H(Y,X) 


G.  Define  the  conditional  entropy  H(X|Y) 

For  a  given  value  of  Y,  say  Y  =  yj ,  the  conditional  entropy  is  defined  as 

Nx 

H(X  I  j)  =  -£Px|Y(i,  j)log2  PxiyO.  j) 

i=l 
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H(X  I  j)  is  the  average  amount  of  information  obtained  by  observing  X  when  Y  =  yj .  H(X)Y),  the 

average  conditional  entropy,  is  H(X  |  j)  averaged  over  all  possible  y’s. 

Ny 

H(X|Y)  =  ;^PY(j)H(X|j) 
j=i 

H(X|Y)  is  the  average  information  obtained  by  observing  X  after  Y  is  known.  Said  another  way,  H(X|Y) 
is  the  average  number  of  additional  bits  required  to  specify  X  if  Y  is  known. 


Lemma:  Relationship  between  conditional  entropy  and  joint  entropy  is  given  by 

H(X|Y)=H(X,Y)-H(Y) 

Demonstration: 

By  definition 

Ny 

H(X|Y)  =  2Py(j)H(X|j) 
j=i 


where 


Nx 

H(X  I  j)  -  -£Px|Y(i,  j)log2  Px|Y(i,  j) 


i=l 


It  has  been  demonstrated  previously  that 


Therefore 


^’x|Y  (b  j)PY  (j)  ~  ^XY  (b  j) 


H(x  I Y) = y  PY(j)(-i)y  ^^YMiog2 

tr  PyG)  PyG) 


Nx  Ny 


H(X  I  Y)  =  -ZE  PxyO’ j) 


i=l  j=l 


PxY(bj) 

PyG) 


Nx  Ny  Nx  Ny 

H(X  I  Y)  = 

(bj)log2PxY(bj)  +  ZZ^XY  (i,j)l0g2PYG) 
i=l  j=l  i=l  j=l 

The  first  term  on  the  right  hand  side  is  the  joint  entropy. 

NyfNx  ] 

H(X|Y)  =  H(X,Y)  +  zz  PxY  (bj)fl0g2PYG) 


It  was  previously  shown  that 


j=i  Li=i 


Nx 

X^XY(bj)  =  PYG) 

i=l 
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Hence 


Ny 

H(X  I  Y)  =  H(X,  Y)  +  2PY(j)log2  P(j) 
j=i 

H(X1Y)=H(X,Y)-H(Y) 


H.  Define  mutual  information  I(X,Y) 

The  average  amount  of  information  obtained  by  observing  X  can  be  conceptualized  as  consisting 
of  two  components. 


Average  amount  of  information  obtained  by  an  observation  of  X 

=  Average  amount  of  information  about  X  obtained  by  observing  Y 

+  Average  amount  of  information  about  X  obtained  by  observing  X  after  Y  is  known 


Two  of  these  elements  have  already  been  defined. 

H(X)  =  Average  amount  of  information  obtained  by  an  observation  of  X 
H(X|Y)  =  Average  amount  of  information  about  X  obtained  by  observing  X  after  Y  is  known 
The  third  element  is  defined  as  the  mutual  information 

I(X,Y)  =  Average  amount  of  information  about  X  obtained  by  observing  Y 

H(X)  =  I(X,Y)+H(X|Y) 

I(X,Y)=H(X)-H(X|Y) 

Shannon  (1948)  used  the  phrase  “rate  of  transmission”  for  this  quantity 


Properties  of  Mutual  Information 

Lemma  1.  H(X)=I(X,X),  referred  to  as  the  self-information 

Lemma  2. 1(X,Y)=H(X)+H(Y)-H(X,Y),  note  that  this  is  the  joint  entropy,  not  the  conditional  entropy 
Lemma  3. 1(X,Y)=I(Y,X),  mutual  information  is  symmetric 
Lemma  4. 1(X,Y)=H(Y)-H(Y|X) 

NxNy  f  p  /j  j\  j 

Lemma  5.  I(X,Y)  =  )  j)Iog2 1  — ~  - — rr  This  is  the  expression  that  will  be  used  in  most  of 

i=i  j=i  [PxCOPyOJ 


the  computations. 

Lemma  6. 1(X,Y)=0  if  X  and  Y  are  independent 
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Lemma  1.  The  relationship  between  entropy  H(X)  and  self-information  I(X,X)  is  given  by 

H(X)=I(X,X) 

Demonstration: 

By  definition 

I(X,Y)=H(X)-H(X|Y) 

It  was  previously  demonstrated  that 

H(X|Y)=H(X,Y)-H(Y),  and 
H(X,X)=H(X) 

Hence 

I(X,X)  =H(X)-H(XiX) 

=  H(X)-{H(X,X)-H(X)} 

=  H(X)-{H(X)-H(X)} 

=  H(X) 


Lemma  2.  Mutual  information  I(X,Y)  can  be  expressed  in  terms  of  entropies  H(X),  H(Y)  and  joint 
entropy  H(X,Y)  by 

I(X,Y)=H(X)+H(Y)-H(X,Y) 

Demonstration: 

By  definition 


From  a  previous  result 


Therefore: 


I(X,Y)=H(X)-H(X|Y) 

H(X|Y)=H(X,Y)-H(Y) 


I(X,Y)=H(X)+H(Y)-H(X,Y) 


Lemma  3.  Mutual  information  is  symmetrical,  that  is 

I(X,Y)=I(Y,X) 

Demonstration: 

By  Lemma  2, 


Therefore: 


I(X,Y)=H(X)+H(Y)-H(X,Y) 
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I(Y,X)=H(Y)+H(X)-H(Y,X) 

It  was  previously  shown  that  the  joint  entropy  is  symmetric.  This  gives 

I(Y,X)=H(Y)+H(X)-H(X,Y)=I(X,Y) 


Lemma  4.  The  relationship  between  mutual  information  and  joint  entropy  is  given  by 

I(X,Y)=H(Y)-H(Y|X) 

By  Lemma  2,  the  left  hand  side  of  the  equation  is: 

LHS=H(X)+H(Y)-H(X,Y) 


By  a  previous  lemma. 


H(Y|X)=H(Y,X)-H(X) 

Using  this  and  the  symmetry  of  the  joint  entropy  gives  the  following  expression  for  the  right  hand  side  of 
the  equation 


RHS=H(Y)-H(H|X) 

=H(Y)-{H(X,Y)-H(X)} 


=H(X)+H(Y)-H(X,Y)=I(X,Y) 


Lemma  5.  Computational  Expression:  Mutual  information  can  be  expressed  in  terms  of  probability  and 
joint  probability  distributions  by 


Nx  Ny 

I(X,Y)  = 

(iJ)log2 


i=l  j=l 


PxvChj)  ] 

lPx(i)PY(j)J 


Demonstration: 

Expanding  the  right  hand  side  of  this  expression  gives: 

Nx  Ny 

^  ^  P XY  (i;  j)  log2  PxvCh  j) 

i=l  j=l 


Nx  Ny 

“  ^  ^  PxY  (i>  j)  log  2  Px  0) 


i=l  j=l 


Nx  Ny 

“^^PxY(i>j)log2  PyO 

i=l  j=l 

Using  the  definition  of  the  joint  entropy  and  rearranging  terms  gives 

RHS=H(X,Y) 

Nx  Ny 

“  ^  log2  Px  (i)^  PxY  (h  j) 

i=l  j=l 
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It  was  previously  argued  that 


Ny  Nx 

■X*°S2PY(j)X^XY(i>j) 

j=l  i=l 


Ny 

Px(i)  =  i;PxY(iJ) 

j=i 

Nx 

PY(j)  =  SPxY(i.j) 

i=l 


Therefore 


Ny 

RHS  =  -H(X,  Y)  -2;Px(i)log2  Px(i)-2PY(j)log2  PyQ) 

i=i  j=i 

RHS=H(X)+H(Y)-H(X,Y)=I(X,Y) 


Lemma  6.  If  X  and  Y  are  statistically  independent,  then  I(X,Y)=0. 


Demonstration; 

It  has  previously  been  argued  that  PxY  (h  j)  =  Px  (i)PY  ( j)  if  ^  Y  are  statistically  independent.  From 


Lemma  5,  we  have 


If  X  and  Y  are  independent,  the  argument  of  log  2  is  identically  one,  and  I(X,Y)=0. 
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Appendix  2.  Jointly  Gaussian  data  sets  and  the  mutual  information  of  jointly  Gaussian  data  set 
pairs 


We  construct  here  a  procedure  for  generating  jointly  Gaussian  data  sets  (Y'}  and  {Y^}  from  two 

independent  Gaussian  data  sets  {X^}  and  {X^} .  This  is  followed  by  a  demonstration  showing  that  the 
mutual  information  of  two  jointly  Gaussian  data  sets  with  a  cross-correlation  coefficient  r  is 
-0.51og(l-r^). 


For  simplicity  of  presentation  we  consider  the  special  case  of  data  sets  that  have  zero  mean  and 
equal  variance.  The  procedure  can  be  extended  to  the  more  general  case.  Data  sets  with  these  limitations 
are,  however,  sufficient  if  the  investigation  is  limited  to  comparison  tests  of  mutual  information 

algorithms.  Let  {X*}  =  (x{,X2,X3, . x^)  and  {X^}  =  (xf,X2,xf, . x^)  be  Gaussian  distributed 

with  zero  mean  and  the  same  variance  .  It  is  further  assumed  that  they  are  uneorrelated,  that  is,  their 
cross-correlation  coefficient  r  is  equal  to  zero.  Given  the  assumption  of  zero  correlation,  their  joint 
probability  distribution  is  the  product  of  their  individual  probability  distributions. 


Ina 


-exp{-[( 


(x*)^  +(x2)^ 


/ 


1 


27i|EJ 


1/2 


exp^x^SxW^l 


1  9 

where  is  the  (X  ,X  )  covariance  matrix. 


y  o' 

Two  data  sets  {Y*}  =  (y},y2,yL . Yn)  and  {Y^}  =  (yf,y2,y3. . Yn)  with  zero  means,  equal 

variance  and  cross-correlation  r  are  jointly  Gaussian  if  their  joint  probability  density  function  is 


PYiY2(y^y^)=-  ^ 


27i|ZJ 


1/2 


exp 


\ryy~yllA 


1  9 

Zy  is  the  (Y  ,Y  )  covariance  matrix. 


^1  r^ 


y-'= - i 


(I_r2)a2 

1  9 

Matrix  A  is  a  two-dimensional  linear  transformation  relating  (X  }  and  {X  } ,  independent  Gaussian 
random  variables,  to  {Y*}  and  {Y^},  jointly  distributed  Gaussian  variables. 


1  -r 
-r  1 
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By  construction 


2 

=  A 

yj 

[yJ 

Using  the  expression  for  and  re-expressing  x  in  terms  of  Ay  gives 

x'^x  =  yVAy  =  a^/XYZ 

Let  A  be  given  by 

A  =  | 


^a 


yC  dy 


Using  this  representation  for  A  and  the  expression  for  above  gives 


A'^A: 


^a^+c^  ab  +  cd'^ 

-  ^ 

f' 

“"1 

^ab  +  cd  b^+d^^ 

l-r 

1. 

Solving  for  b,  c  and  d  in  terms  of  a  and  r  gives 

a  -ar  +  -y/l-a^(l-r^) 


A: 


jl-a^a-r^)  Jl-a^(l-r^)  ^ 


I  (i_r2)  y  (l_rO 

There  are  an  infinity  of  A’s  that  depend  on  the  choice  of  a.  We  use  here  the  simplest  case,  a=l. 


r  1 


A  = 


0 

-1 


Vl-r^  Vl-r^ 


A”‘  - 


1 


0 


r 


In  the  next  step,  we  need  to  establish  the  relationship  cited  in  the  text  between  mutual  information 
I(Y’,Y^)  and  r,  the  cross-correlation  coefficient.  In  this  derivation,  we  use  the  property  that  {Y'}  and 
{Y^}  are  jointly  distributed,  have  correlation  r,  and  are  related  to  independent  Gaussian  data  sets  {X*} 

and  {X^}  by  linear  transformation  A.  The  derivation  begins  with  the  integral  representation  for  mutual 
information  expressed  in  terms  of  the  joint  and  individual  probability  density  functions.  The  integrals  are 
taken  fi'om  -  oo  to  +  oo  . 
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i(y\y2)=  JjpY,Y2(y\y')iog- 


IdyMy^ 


^Py.(y)PY2(y') 


By  construction,  Y*  and  Y^  are  jointly  Gaussian.  Therefore, 


PYiY2(y‘>y^)= 


1 


27t|EY  I 


1/2 


Ey  is  the  Y* ,  Y^  covariance  matrix  where  as  before  equal  variances  are  assumed. 


Ev-^ 

1  9 

Y  and  Y  are  Gaussian  distributed. 


^1  r^ 

h 


„  1/2  -  - 
=aHl-r^) 


2x1/2 


PY.(y‘)=- 


-(y‘)2/2a^ 


PY2(y')=- 


-(y2)2/2cr2 


fi 


na 


This  gives  the  following  expression  for  mutual  information 


I(Y^Y^)  = 


B 


/S;‘y/2 


271 1  Ey  I 


1/2 


log 


e  - 


27t  I  Ey  I 


1/2 


s 


7ta 


dyMy 


1  /9 

Given  the  previously  stated  expression  for  |  Zy  |  ,  the  expression  for  mutual  information  simplifies  to 


I(Y*,Y2): 


J  J27ca^(l-r 


_r2)i/2 


log 


-(y')2/2a2g-(y2)^/2a2(l_r2y/2 


dyMy^ 


1  9 

This  can  be  re-expressed  as  an  integral  in  x  and  x  . 


y  =  A“‘x 


fxO 

r  x'  ] 

[r  -Vr:7j 

^rx*  -Vl-r^x^^ 

dy'dy^  = 


5(/,y') 


d(x\x^) 


dxMx^ 


where  the  right  hand  side  of  the  last  equation  is  the  absolute  value  of  the  determinant  of  the  Jacobian  of 
the  transformation. 
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dyMy^ 


1  0 


dx'dx^  =  Vl-r^ dx^dx^ 


By  consti'uction  Y  =  •  This  gives 


I(Y’,Y^): 


^-x^x/2cj^ 


-x'^x/2a^ 


g-(y')2  /2a2  g-(y2)2 /aa^  _  j.2 )l/2 


dxMx^ 


Taking  logarithms  of  the  exponentials  gives 


-x’^x/2ct^ 


I(Y\Y0  = 


2na^  L2a^ 


V  (-  ^ + (y^  )^ + (y  ^  )^ )-  log  Vi^jdxMx^ 


19  9  9 

An  expression  for  (y  )  +  (y  )  is  constructed  from  the  defining  linear  relationship  between  x’s  and  y’s. 


y  =  A  ^x 


M  n  0 


y~J  [T  -Vl-r‘Ax‘;  lrx‘-Vl-r"x 


2  v2  „,i 


(y’)^  +(Y^)^  +  {rx'  -Vl-r^x^}^  =(x')^  +r^(x^)^  -2rVfr^x’x^  +(l-r^)(x^)^ 

This  expression  in  the  integral  becomes 

-x'^x  +  (y*)^  +(y2)^  --(x')^  -(x^)^  +(x')2  +r2(x^)2  -2r^]l-r^x^x^  +0.-r^)(x^f 

^T^(x^f  -T^(x^f  -2rVl-r^x^x^ 

The  integral  for  mutual  information  becomes 


I(Y\y2)=  ^ - 


x’^\/2c^ 


2na^  l2a^ 


^fr^(x^)^  -r^(x^)^  - 2rVl  -r^ x'x^  V log Vl  - r^  IdxMx^ 


Consider  the  integral 


-X^x/2ct2 


27ia^  l2a^ 


i^(r2(xi)2-r2(x2)2)Wdx2 


The  two  terms  are  of  equal  magnitude  and  opposite  sign,  and  the  double  integral  is  therefore  equal  to 
zero.  Similarly  consider 


^”X^x/2a^ 


2na^  [Ig'^ 


-r^x^x^  Udx^dx^ 


Each  integral  is  of  an  odd  function  over  the  range  -  co  to  +  co  and  is  therefore  equal  to  zero.  The  integral 
for  mutual  information  simplifies  to 
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I(Y\y2)  =  - 


2  ^  dx'dx^ 


Using 


gives 


I(Y‘  .  )  =  -  log  VT^  =  -ilog(l  -  ) 
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Appendix  3.  Binary  representation  of  XY  partitioning  and  generalization  to  embedded 
data 


The  previous  section  provided  details  of  the  local  adaptive  partitioning  used  by  Fraser  and 
Swinney  to  calculate  mutual  information.  The  space  being  partitioned  is  that  of  the  joint  distribution  of 
X  =  {xi,X2,-"Xn}  and  Y  =  {yi,y2,-“yN}>  ^  subset  of  the  XY  plane  which  may  be  considered  a  two- 
dimensional  embedding  space  whose  elements  are  (Xj,yj) ,  i=l,2,....N.  The  following  steps  are  used  to 
implement  the  procedure: 

1 .  Let  the  number  of  elements  of  both  X  and  Y  be  N  =  2"  (the  binary  logic  of  the  algorithm  requires 
N  =  2"). 

2.  Rank  order  both  X  and  Y  with  no  repeated  elements  so  that  they  both  map  to  permutations  of  the 

integers  0,  1,  2"-l.  To  avoid  repeated  elements,  one  may  assign  higher  ranks  to  numbers  appearing 

earlier  in  the  series.  Call  these  rank-ordered  lists  X^  =  {xf  ,xf  .•■■x^^}  and  Y^  =  {yf  jYa  "’yN} •  ^ 
and  Y^  are  equiprobable. 

3.  Transform  the  elements  of  X*^  to  binary.  Since  theO  <  x^  <  2”  - 1 ,  these  binary  representations  have  at 

most  n  bits  -  i.e.,  x^  Here,  a^~^  is  the  most  significant  bit  of  x^,  ajj“^  the  second 

most  significant,  etc.  Perform  the  same  transformation  on  the  elements  of  Y^  to  get  y^  =  •  •  ’bk  • 

4.  Interleave  the  bits  of  x^  and  y^  to  get 

z^=(a">r'ar'br'-aK).  (1) 

The  two  leftmost  elements  of  are  the  most  significant  bits  of  x^  and  y^  ,  respectively,  the  next  two 
are  the  next  most  significant  bits,  etc.  For  example,  suppose  (x^  ,yk  )  =  (5,47)  .  Then,  using  the  binary 
representations,  5  =  000101  and  47=101111,  the  interleaved  representation  of  (Xk,yk)  is 

(Xk,yk)=^z^  =(010001110111). 

A  crucial  advantage  of  this  representation  derives  from  the  observation  that  the  successive  bit 
pairs  provide  a  tree  representation  for  the  location  of  (x^  ,yk  )  i^^  ibe  two-dimensional  embedding  space. 
To  see  this,  label  the  axes  of  a  two-dimensional  embedding  space  by  x  and  y  and  consider  the  region 
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0<x,y<2^  -1.  If  this  region  is  subdivided  into  4  quadrants  as  in  Figure  13a,  then  the  bottom-left 
quadrant  contains  all  those  vectors  with  six-bit  x's  whose  most  significant  bits  are  0  and  with  y's  whose 
most  significant  bits  are  also  zero,  the  bottom-right  quadrant  contains  all  those  x's  whose  most  significant 
bits  are  1  and  those  y’s  whose  most  significant  bits  are  0,  etc.  The  location  of  any  interleaved  point  in  this 

subdivision  is  thus  labeled  by  its  first  two  elements;  the  (x^,  y^ )  in  our  example  is  in  quadrant  01.  If  this 

quadrant  is  again  subdivided  into  four,  the  next  two  bits  of  specify  its  location  in  the  new  subdivision 
(Figure  13  b),  and  so  on. 
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Figure  13.  (a)  Partition  of  0  <  x,y  <  2^-1  into  4  quadrants,  (b)  Partition  of  quadrant  01  (upper  left)  into 
four  sub-quadrants. 


The  technique  of  interleaving  may  also  be  used  to  implement  time-delay  embedding.  Consider  the 
m-dimensional  embedding  of  X  with  a  specified  lag 

^  “  (^k  ’  ^k+Lag  ’  ^k+2Lg  > . ^k+(m-l)Lag  ) 
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Using  the  notation  of  Equation.  (1),  the  m-dimensional  embedding  vector,  Xj^ ,  may  be  represented  as 


X, 
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n-2 
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a  number  that  uniquely  represents  Xjc.  A  similar  embedding  and  interleaving  of  Y  gives 
Y  =  (y  ic ,  y  k+Lag  >  y k+2Lg  > . Y k+(m-l)Lag  )  and 


^''k  -(bk  ^bk+Lag  ■ 


*  *^k+(k-l)LagAOk  ^k+Lag  '  ‘  *  Ok4-(k-l)Lag  A 


•(b^^Lag-b  k+(m“l)Lag  ) 


The  interleaved  sets,  {Uk}  and  {Vk},  each  consists  of  2"  numbers,  each  number  specified  by  nxm  bits.  To 
calculate  the  mutual  information  of  X  and  Y  {Uk}  and  {Vk}  are  converted  to  decimal  and  used  as  inputs  in 
the  Fraser-Swinney  algorithm. 


55 


